About the project

In this project I’ll learn the basics of MarkDown and git/GitHub. Here’s a link to my IODS-project GitHub page


Regression and model validation

Haven’t done this excercise :(


Logistic regression

Haven’t done this excercise ):


Clustering and classification

The topics of this chapter - clustering and classification - are handy and visual tools of exploring statistical data. Clustering means that some points (or observations) of the data are in some sense closer to each other than some other points. In other words, the data points do not comprise a homogeneous sample, but instead, it is somehow clustered.

In general, the clustering methods try to find these clusters (or groups) from the data. One of the most typical clustering methods is called k-means clustering. Also hierarchical clustering methods quite popular, giving tree-like dendrograms as their main output.

As such, clusters are easy to find, but what might be the “right” number of clusters? It is not always clear. And how to give these clusters names and interpretations?

Based on a successful clustering, we may try to classify new observations to these clusters and hence validate the results of clustering. Another way is to use various forms of discriminant analysis, which operates with the (now) known clusters, asking: “what makes the difference(s) between these groups (clusters)?”

In the connection of these methods, we also discuss the topic of distance (or dissimilarity or similarity) measures. There are lots of other measures than just the ordinary Euclidean distance, although it is one of the most important ones. Several discrete and even binary measures exist and are widely used for different purposes in various disciplines.

Point 2

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506  14

Comments on point 2

The dataset contains 506 obersvations and 14 variables describing crime statistic in Boston area. Description of dataset variables ca be found from here

Point 3

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Comments on point 3

Based on the scatter plot it seems that variables rm and lstat have a moderate anticorrelation (r = -0.614) whereas rm and medv seems to have positive correlation (r = 0.695). In addition, variable indus seems to have rather strong correlation with tax and nox (r = 0.721 and r = 0.764 respectively). dis weighted mean of distances to five Boston employment centres.

nox = nitrogen oxides concentration (parts per 10 million).

rm = average number of rooms per dwelling.

tax = full-value property-tax rate per $10,000.

lstat = lower status of the population (percent).

medv = median value of owner-occupied homes in $1000s.

Point 4

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Comments on point 4

As can be seen from the plot above scaling the data frame obviously preseves the shape and distribution, however, the mean values are now set to zero and standard deviation to 1.

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Point 5

Point 6

##           predicted
## correct    low med_low med_high high
##   low       18       4        1    0
##   med_low    5      12       14    0
##   med_high   1       5       22    2
##   high       0       0        1   17
## Loading required package: lattice
## Confusion Matrix and Statistics
## 
##           predicted
## correct    low med_low med_high high
##   low       18       4        1    0
##   med_low    5      12       14    0
##   med_high   1       5       22    2
##   high       0       0        1   17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6765          
##                  95% CI : (0.5766, 0.7658)
##     No Information Rate : 0.3725          
##     P-Value [Acc > NIR] : 4.815e-10       
##                                           
##                   Kappa : 0.5639          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: low Class: med_low Class: med_high Class: high
## Sensitivity              0.7500         0.5714          0.5789      0.8947
## Specificity              0.9359         0.7654          0.8750      0.9880
## Pos Pred Value           0.7826         0.3871          0.7333      0.9444
## Neg Pred Value           0.9241         0.8732          0.7778      0.9762
## Prevalence               0.2353         0.2059          0.3725      0.1863
## Detection Rate           0.1765         0.1176          0.2157      0.1667
## Detection Prevalence     0.2255         0.3039          0.2941      0.1765
## Balanced Accuracy        0.8429         0.6684          0.7270      0.9413

Comments on point 6

On overall, the classification with LDA seems to be working quite nicely. Classification accuracy is 0.765 (95% CI 0.6704, 0.8431) and Kappa value of 0.68 indicates substantial agreement.

Point 7

Comments on point 7

Based on the plot above k=2 is the optimal number of clusters.

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

Comments on point 7 contd.

Most obvious discriminating variables, bases on the plot above, are at least zn, nox, and tax. Blue group indicates cases where crime rates are higher and in red group the crime rates seems to be usually lower.

Bonus

Comments on Bonus

Strongest predictor variables are nox, tax, zn, and medv

Super-Bonus

## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## [1] 404  13
## [1] 13  3

Comments on super-bonus

Clustering (k=4) is not a perfect match with the real crime categories but is resembels it remarcably well.


Dimensionality reduction techniques — Data wrangling

## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

Dimensionality reduction techniques — Analysis

point 1

## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The dataset human contains information regarding health, education, and govermental status collected from 155 countries. Short variable descriptions:

Edu2.FM: People with secondary education (female / male)

Labo.FM: Labour force participation rate (female / male)

Edu.Exp: Expected years of education

Life.Exp: Life expectancy at birth

GNI: Gross national income per capita

Mat.Mor: Maternal mortality ratio

Ado.Birth: Adolecent birth rate

Parli.F: Percent representation in parliament

For more detailed information please see: http://hdr.undp.org/sites/default/files/hdr2015_technical_notes.pdf

point 2

Brief comments on the data based on the plot above.

  • Life expectancy has a strong correlations with education and strong negative correlation with maternal mortality and adolecent birth rate.
  • Similarly GNI seem to correlate well with education and life expectancy. *Adolecent birth rate and maternal mortality are stronly linked.
  • Interestingly percent precentation parliment doesn’t seem to correlate strongly with any of the variables.

point 3

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Without scaling the above PCA biplot is dominated solely by GNI due to it’s large absolut values and variance.

point 4

After normalizing the variables the underlying effect of individual variables can be seen from the biplot above more clearly.

point 5

The first PC seems to capture nicely a transition between countries that have high life expectancy, high expected time of education, high women education, high GNI and low maternal mortality and low adolecent birth rate to those with opposite values. The second PC seems to aling with percent representation in parliament and labour force participation rate (female / male). As a note Nordic countries cluster to top-left corner indicating high value in both PC1 and PC2.

point 6

## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] 300  36
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 
## Warning: attributes are not identical across measure variables; they will
## be dropped

## 
## Call:
## MCA(X = tea, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.231   0.187   0.169   0.149   0.135   0.130
## % of var.             23.078  18.651  16.911  14.922  13.478  12.959
## Cumulative % of var.  23.078  41.729  58.641  73.563  87.041 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1             |  0.011  0.000  0.000 | -0.841  1.265  0.925 |  0.186
## 2             | -0.738  0.787  0.886 | -0.012  0.000  0.000 |  0.131
## 3             | -0.121  0.021  0.018 |  0.562  0.565  0.386 | -0.328
## 4             |  0.314  0.143  0.134 | -0.366  0.240  0.182 |  0.291
## 5             |  0.252  0.091  0.049 | -0.157  0.044  0.019 | -0.621
## 6             | -0.118  0.020  0.020 | -0.051  0.005  0.004 |  0.269
## 7             | -0.422  0.257  0.239 | -0.526  0.495  0.372 |  0.164
## 8             | -0.121  0.021  0.018 |  0.562  0.565  0.386 | -0.328
## 9             | -0.062  0.006  0.004 | -0.256  0.117  0.062 | -0.057
## 10            | -0.108  0.017  0.012 | -0.427  0.326  0.187 | -0.400
##                  ctr   cos2  
## 1              0.068  0.045 |
## 2              0.034  0.028 |
## 3              0.212  0.131 |
## 4              0.167  0.115 |
## 5              0.760  0.298 |
## 6              0.142  0.101 |
## 7              0.053  0.036 |
## 8              0.212  0.131 |
## 9              0.006  0.003 |
## 10             0.315  0.164 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## home          |  -0.078   0.422   0.195  -7.636 |  -0.030   0.076   0.028
## Not.home      |   2.511  13.660   0.195   7.636 |   0.956   2.448   0.028
## evening       |   0.594   8.761   0.185   7.432 |   0.169   0.871   0.015
## Not.evening   |  -0.311   4.580   0.185  -7.432 |  -0.088   0.456   0.015
## breakfast     |  -0.455   7.163   0.191  -7.552 |  -0.640  17.574   0.378
## Not.breakfast |   0.420   6.612   0.191   7.552 |   0.591  16.222   0.378
## No.sugar      |  -0.603  13.557   0.388 -10.776 |   0.395   7.194   0.167
## sugar         |   0.644  14.492   0.388  10.776 |  -0.422   7.691   0.167
## F             |  -0.371   5.890   0.201  -7.744 |   0.542  15.564   0.428
## M             |   0.541   8.594   0.201   7.744 |  -0.791  22.709   0.428
##                v.test     Dim.3     ctr    cos2  v.test  
## home           -2.906 |  -0.124   1.468   0.497 -12.186 |
## Not.home        2.906 |   4.007  47.475   0.497  12.186 |
## evening         2.107 |  -0.914  28.264   0.437 -11.427 |
## Not.evening    -2.107 |   0.478  14.777   0.437  11.427 |
## breakfast     -10.634 |  -0.134   0.855   0.017  -2.234 |
## Not.breakfast  10.634 |   0.124   0.789   0.017   2.234 |
## No.sugar        7.057 |  -0.026   0.036   0.001  -0.472 |
## sugar          -7.057 |   0.028   0.038   0.001   0.472 |
## F              11.316 |  -0.033   0.064   0.002  -0.690 |
## M             -11.316 |   0.048   0.093   0.002   0.690 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## home          | 0.195 0.028 0.497 |
## evening       | 0.185 0.015 0.437 |
## breakfast     | 0.191 0.378 0.017 |
## sugar         | 0.388 0.167 0.001 |
## sex           | 0.201 0.428 0.002 |
## healthy       | 0.225 0.103 0.062 |

Comments on MCA analysis: * Based on the plot above it seems that males are drinking their tea with sugar more often than females. * For some reason variable Not Home is clearly separate from other variables (maybe because there’s only really few of those that are not drinking tea at home). * Another interesting note from the plot is that those who are healthy tend to drink tea more often during breakfast whereas those who are not healthy drink more often during evenings.